AI Trained on Evolution’s Playbook Develops Proteins that Spur Drug and Scientific Discovery

September 25, 2024 • by Marc Airhart

EvoRank offers a new and tangible example of how AI may help bring disruptive change to biomedical research and biotechnology more broadly.

A colorful ribbon with elaborate twirls and twists represents the three-dimensional shape of a molecule

Using the MutRank framework trained with EvoRank, Danny Diaz and professor Andrew Ellington’s team are developing an improved version of a protein critical for the biomanufacturing of mRNA therapeutics and vaccines. In this example, the model recommends keeping the blue parts the same as the natural version of the protein and strongly considering mutating the red parts. Credit: Danny Diaz/University of Texas at Austin.


A new artificial intelligence model developed by researchers at The University of Texas at Austin paves the way for more effective and less toxic treatments and new preventive strategies in medicine. The AI model informs the design of protein-based therapies and vaccines by leveraging the underlying logic from nature’s evolutionary processes.

The AI advance, called EvoRank, offers a new and tangible example of how AI may help bring disruptive change to biomedical research and biotechnology more broadly. Scientists described the work at the International Conference on Machine Learning and published a related paper in Nature Communications about leveraging a broader AI framework to identify useful mutations in proteins. 

A major obstacle to designing better protein-based biotechnologies is having enough experimental data about proteins to adequately train AI models to understand how specific proteins work and thus how to engineer them for specific purposes. The key insight with EvoRank is to harness the natural variations of millions of proteins generated by evolution over deep time and extract the underlying dynamics needed for workable solutions to biotech challenges.

“Nature has been evolving proteins for 3 billion years, mutating or swapping out amino acids and keeping those that benefit living things,” said Daniel Diaz, a research scientist in computer science and co-lead of the Deep Proteins group, an interdisciplinary team of computer science and chemistry experts at UT. “EvoRank learns how to rank the evolution that we observe around us, to essentially distill the principles that determine protein evolution and to use those principles so they can guide the development of new protein-based applications, including for drug development and vaccines, as well as a wide range of biomanufacturing purposes.”

UT is home to one of the leading programs in the country for AI research and houses the National Science Foundation-funded Institute for Foundations of Machine Learning (IFML) led by computer science professor Adam Klivans, who also co-leads Deep Proteins. Today, the Advanced Research Projects Agency for Health announced a grant award involving Deep Proteins and vaccine-maker Jason McLellan, a UT professor of molecular biosciences, in collaboration with the La Jolla Institute for Immunology. The UT team will receive nearly $2.5 million to begin to apply AI in protein engineering research into developing vaccines to fight herpesviruses.  

“Engineering proteins with capabilities that natural proteins do not have is a recurring grand challenge in the life sciences,” Klivans said. “It also happens to be the type of task that generative AI models are made for, as they can synthesize large databases of known biochemistry and then generate new designs.” 

Unlike Google DeepMind’s AlphaFold, which applies AI to predict the shape and structure of proteins based on each one’s sequence of amino acids, the Deep Proteins group’s AI systems suggest how best to make alterations in proteins for specific functions, such as improving the ease with which a protein can be developed into new biotechnologies. 

McLellan’s lab is already synthesizing different versions of viral proteins based on AI-generated designs, then testing their stability and other properties. 

“The models have come up with substitutions we never would have thought of,” McLellan said. “They work, but they aren’t things we would have predicted, so they’re actually finding some new space for stabilizing.”

Protein therapeutics often have fewer side effects and can be safer and more effective than the alternatives, and the estimated $400 billion global industry today is primed to grow more than 50% during the next decade. Still, developing a protein-based drug is slow, costly and risky. An estimated $1 billion or more is needed for the decade-plus journey from drug design to completing clinical trials; even then, the odds of securing approval from the Food and Drug Administration for a company’s new drug are only about 1 in 10. What’s more, to be useful in therapeutics, proteins often need to be genetically engineered, for example, to ensure their stability or to allow them to yield at a level needed for drug development—and cumbersome trial-and-error in labs traditionally has dictated such genetic engineering decisions. 

If EvoRank—as well as the related UT-created framework on which it builds, Stability Oracle—are commercially adapted, industry would have opportunities to shave time and expense from drug development, with a road map to arrive at better designs faster. 

A spreadsheet with about a dozen rows and a couple of dozen columns. Each field contains a letter that reprsents an amino acid in a sequence.

Each row in this chart represents the same protein in a different organism, each with its own subtle tweaks. Each letter represents a different amino acid in a sequence (e.g., H = histidine). Researchers can look at one position in the amino acid sequence (e.g., the column outlined in red) and see how often evolution selected a particular amino acid for that position in the protein. Credit: University of Texas at Austin.

Using existing databases of naturally occurring protein sequences, the researchers who created EvoRank essentially lined up different versions of the same protein that appear in different organisms—from starfish to oak trees to humans—and compared them. At any given position in the protein, there might be one of several different amino acids that evolution has found to be useful, with nature selecting, say, 36% of the time the amino acid tyrosine, 29% of the time histidine, 14% of the time lysine—and even more importantly never leucine. Using this gold mine of existing data reveals an underlying logic in protein evolution. Researchers can knock out options that, evolution suggests, would result in killing the protein’s functionality. The team uses all of this to train the new machine learning algorithm. Based on continuous feedback, the model learns which amino acid nature opted for during the past when evolving proteins, and it bases its understanding on what’s plausible in nature and what is not. 

Diaz next plans to develop a “multicolumn” version of EvoRank that can evaluate how multiple mutations at the same time affect a protein’s structure and stability. He also wants to build new tools for predicting how a protein’s structure relates to its function.

Besides Klivans and Diaz, computer science graduate student Chengyue Gong and UT alumnus James M. Loy co-authored both works. Tianlong Chen and Qiang Liu also contributed to EvoRank; Jeffrey Ouyang-ZhangDavid YangAndrew D. Ellington and Alex G. Dimakis additionally contributed to Stability Oracle. The research was funded by the NSF, the Defense Threat Reduction Agency and The Welch Foundation.

Read the companion feature story about Diaz and this work.

Share


A representation of a humanoid robot in burnt orange has lines of code in the place where its face would be.

Department of Computer Science

Researchers Reduce Human Effort in Robot Training